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Abstract— Classification Data Mining (DM) Techniques can 
be a very useful tool in detecting and identifying e-banking 
phishing websites. In this paper, we present a novel approach 
to overcome the difficulty and complexity in detecting and 
predicting e-banking phishing website. We proposed an 
intelligent resilient and effective model that is based on using 
association and classification Data Mining algorithms. These 
algorithms were used to characterize and identify all the 
factors and rules in order to classify the phishing website and 
the relationship that correlate them with each other. We 
implemented six different classification algorithm and 
techniques to extract the phishing training data sets criteria to 
classify their legitimacy. We also compared their 
performances, accuracy, number of rules generated and 
speed. A Phishing Case study was applied to illustrate the 
website phishing process. The rules generated from the 
associative classification model showed the relationship 
between some important characteristics like URL and Domain 
Identity, and Security and Encryption criteria in the final 
phishing detection rate. The experimental results 
demonstrated the feasibility of using Associative Classification 
techniques in real applications and its better performance as 
compared to other traditional classifications algorithms. 


Keywords— Phishing, Fuzzy Logic, Data 
Classification, Association, e-banking risk assessment. 
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I. INTRODUCTION 


There are number of users who purchase products online and 
make payment through e- banking. There are e- banking websites 
who ask user to provide sensitive data such as username, 
password or credit card details etc often for malicious reasons. 
This type of e-banking websites is known as phishing website. In 
order to detect and predict e-banking phishing website. We 
proposed an intelligent, flexible and effective system that is based 
on using classification Data mining algorithm. We implemented 
classification algorithm and techniques to extract the phishing 
data sets criteria to classify their legitimacy. The e-banking 
phishing website can be detected based on some important 
characteristics like URL and Domain Identity, and security and 
encryption criteria in the final phishing detection rate. Once user 
makes transaction through online when he makes payment 
through e-banking website our system will use data mining 
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algorithm to detect whether the e-banking website is phishing 
website or not. This application can be used by many E-commerce 
enterprises in order to make the whole transaction process secure. 
Data mining algorithm used in this system provides better 
performance as compared to other traditional classifications 
algorithms. With the help of this system user can also purchase 
products online without any hesitation. 


Phishing websites is a semantic attack which targets the user 
rather than the computer. It is a relatively new Internet crime in 
comparison with other forms, e.g., virus and hacking. The 
phishing problem is a hard problem because of the fact that it is 
very easy for an attacker to create an exact replica of a good 
banking site, which looks very convincing to users. The word 
phishing from the phrase “website phishing” is a variation on the 
word “fishing”. The idea is that bait is thrown out with the hopes 
that a user will grab it and bite into it just like the fish. In most 
cases, bait is either an e-mail or an instant messaging site, which 
will take the user to hostile phishing websites [7]. The motivation 
behind this study is to create a resilient and effective method that 
uses Data Mining algorithms and tools to detect e-banking 
phishing websites in an Artificial Intelligent technique. 
Associative and classification algorithms can be very useful in 
predicting Phishing websites. It can give us answers about what 
are the most important e-banking phishing website characteristics 
and indicators and how they relate with each other. Comparing 
between different Data Mining classification and association 
methods and techniques is also a goal of this investigation since 
there are only few studies that compares different data mining 
techniques in predicting phishing websites. 


ILSYSTEM ANALYSIS 
Phishing Characteristics and Indicators 


we managed to gather 27 phishing features and indicators and 
clustered them into six Criteria (URL & Domain Identity, 
Security & Encryption, Source Code & Java script, Page Style & 
Contents, Web Address Bar and Social Human Factor ), and each 
criteria has its own phishing components. For example, URL & 
Domain Identity Criteria has five phishing indicator components 
(Using IP address, abnormal request URL, abnormal URL of 
anchor, abnormal DNS record and abnormal URL). 


The full list is shown in table 3 which is used later on our analysis 
and methodology study. 


Data Mining has been described as "the nontrivial extraction of 
implicit, unknown, and potentially useful information from large 
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data sets [18]. It is a powerful new technology to help researchers 
focus on the important information in their data archive. Data 
mining tools predict future trends and behaviors, allowing 
businesses to make proactive, knowledge-driven decisions [19]. 


Despite growing efforts to educate users and create better 
detection tools, users are still very susceptible to phishing attacks. 
Unfortunately, due to the nature of the attacks, it is very difficult 
to estimate the number of people who actually fall victim. 


Tablel.Main Phishing Indicators with Criteria 


Criteria N |Phishing Indicators 


1 Using IP address 


[Abnormal request URL 
URL & 


2 
3 |Abnormal URL of anchor 
4 {Abnormal DNS record 

5 |Abnormal URL 


Domain Identity 


Using SSL certificate (Padlock 


Icon) 


= 


Security & Encryption (Certificate authority 


2 
3 |Abnormal cookie 
4 [Distinguished names certificate 


1 (Redirect pages 


Source Code & Javal Straddling attack 
script 3 |Pharming attack 
4 |OnMouseOver to hide the Link 


1 [Spelling errors 


Copying website 


Page Style & Contents Using forms with Submit button 


2 

3 

4 Using pop-ups windows 
5 [Disabling right-click 


1 |Long URL address 


Replacing similar char for URL 
|Web Address Bar 


2 
3 |Adding a prefix or suffix 
4 Using the @ Symbol to confuse 


1 (Emphasis on security 


Social Human Factor |2 [Public generic salutation 


3 |Buying time to access accounts 


IILSYSTEM CONSTRUCTION 


CBA and MCAR experiments were conducted using an 
implementation version provided by the authors of [20], [26]. We 
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have chosen these algorithms based on the different strategies 
they use to generate the rules and since their learnt classifiers are 
easily understood by human. 


Data Miner 
(Associative 


Configuration 
parameters 


Preprocessor 


Phishing websites 
archiving details 


Algorithms - Apriori) 


Records 


Associative 
Classification 
Techniques 


| 
\ 


Website Phishing Rate 
Figure 1. AC Model for Detecting Phishing Websites 


We used two web access archives, one from APWG archive [1] 
and one from Phishtank archive [24]. We managed to extract the 
whole 27 phishing security features and indicators and clustered 
them to its 6 corresponding criteria as mentioned before in table 1. 


Mining e-Banking Phishing Challenges 


The age of the dataset is the most significant problem, which is 
particularly relevant with the phishing corpus. E- banking 
Phishing websites are short-lived, often lasting only in the order 
of 48 hours. Some of our features can therefore not be extracted 
from older websites, making our tests difficult. The average 
phishing site stays live for approximately 2.25 days [14]. 
Furthermore, the process of transforming the original e-banking 
phishing website archives into record feature row data sets is not 
without error. It requires the use of heuristics at several steps. 
Thus high accuracy from the data mining algorithms cannot be 
expected. However, the evidence supporting the golden nuggets 
comes from a number different algorithms and feature sets and we 
believe it is compelling [15]. 


Classification Algorithm 


We utilize six different common DM classification algorithms 
(C4.5, JRip, PART, PRISM, CBA and MCAR). Our choice of 
these methods is based on the different strategies they used in 
learning rules from data sets [17]. The C4.5 algorithm [23] 
employs divide and conquer approach, and the RIPPER algorithm 
uses separate and conquer approach. The choice of PART 
algorithm is based on the fact that it combines both approaches to 
generate a set of rules. It adapts separate- and-conquer to generate 
a set of rules and uses divide- and-conquer to build partial 
decision trees. PRISM is a classification rule which can only deal 
with nominal attributes and doesn't do any pruning. It implements 
a top- down (general to specific) sequential-covering algorithm 
that employs a simple accuracy-based metric to pick an 
appropriate rule antecedent during rule construction. CBA 
algorithm employs association rule mining [20] to learn the 
classifier and then adds a pruning and prediction steps. Finally, 
MCAR algorithm consists of two phases: rules generation and a 
classifier builder. In the first phase, MCAR scans the training data 
set to discover frequent single items, and then recursively 
combines the items generated to produce items involving more 
attributes. MCAR then generates ranks and stores the rules. In the 
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second phase, the rules are used to generate a classifier by 
considering their effectiveness on the training data set. This 
results in a classification approach named associative 
classification [6] [7]. MCAR utilizes database coverage pruning to 
decrease the number of rules. Since without adding constraints on 
the rule discovery, the very large numbers of rules, make humans 
unable to understand classifier. This pruning technique tests the 
generated rules against the training data set, and only high quality 
rules that cover at least one training instance not considered by 
other higher ranked rules are kept for later classification. 


MCAR (Multi Class Classification based on Association Rule) 
algorithm which had an accuracy of 88.4 % and error rate of 
12.622 % is shown below. 


Rule 1: 
Social_Human_Factor = Fraud 
Web_Address_Bar = Fraud 
Page_Style_&_Contents = Doubtful 
->class = Phishing 
Rule 16: 
Web_Address_Bar = Genuine 
Security_&_Encryption = Doubtful 
URL_Domain_Identity = Doubtful 
->class = Legitimate 
Rule 22: 
Social_Human_Factor = Genuine 
Page_Style_&_Contents = Doubtful 
->class = Suspicious 


We recorded the prediction accuracy and the number of rules 
generated by the traditional classification algorithms and the new 
associative classification approaches we used in Table 5 and 6 
respectively. Experiments were conducted using stratified ten-fold 
cross-validation. 


Table 5. Results From Weka four Classifiers 


C4.5 P.A.R.T. JRip PRISM 

Test Mode 10 FOLD CROSS VALIDATION 

Url Domain Identity Security & Encryption Source 
lene Code & Java Page Style & Contents Web Address 

Bar Social Human Factor 
No.of Rules 57 38 14 155 
Correctly 848 869 818 855 
Classified (84.2 %) (86.3 %) (81.3%) (84.9%) 
Incorrectly 158 137 188 141 
Classified (15.7%) (13.6%) (18.6 %) (14.0%) 
Instances 1006 1006 1006 1006 
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Table 6. Results from CBA and MCAR Classifiers 


CBA MCAR 
Num of Test Case 
Correct Prediction 873 886 
Level Limit 


IV.CONCLUSION 


Number of rules 


The most important way to protect the user from phishing attack 
is the education awareness. Internet users must be aware of all 
security tips, which are given by experts. Every user should also 
be trained not to blindly follow the links to websites where they 
have to enter their sensitive information. It is essential to check 
the URL before entering the website. It has become a serious 
network security problem, facing financial loss of billions of 
dollars to both consumers and the e-commerce companies. And 
perhaps more eventually, phishing has made e-commerce 
distrusted and attractive to normal consumers. 


E-banking phishing website model based on classification data 
mining showed the significance importance of the phishing 
website two criteria’s (URL & Domain Identity) and (Security & 
Encryption) in the final phishing detection rate, and also showed 
the insignificant trivial influence of some other criteria like ‘Page 
Style & content’ and ‘Social Human Factor’ in the final phishing 
rate. The rules generated from the associative classification model 
showed the correlation and relationship between some of their 
characteristics which can help us in building phishing website 
detection system. The experiments demonstrate the feasibility of 
using Associative Classification techniques in real applications 
involving large databases and its better performance as compared 
to other traditional classification algorithms. 


In Future System can upgrade to automatic Detect the web page 
and the compatibility of the Application with the web browser. 
Additional work also can be done by adding some other 
characteristics to distinguishing the fake web pages from the 
legitimate web pages. Phish Checker application also can be 
upgraded into the web phone application in detecting phishing on 
the mobile platform. 
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